[Arrow]Configure max deduplication length for StringView#8990
[Arrow]Configure max deduplication length for StringView#8990alamb merged 1 commit intoapache:mainfrom
StringView#8990Conversation
59c1e83 to
d029734
Compare
| // the idx is current length of views_builder, as we are inserting a new view | ||
| vacant.insert(self.views_buffer.len()); | ||
| // (2) len > `max_deduplication_len` | ||
| if length > self.max_deduplication_len { |
There was a problem hiding this comment.
In practice I suspect most strings that make sense to deduplicate are relatively short, e.g. <64 bytes.
Pardon, I guess here comes some confusion? The code behaves like min_deduplication_len.
62ce488 to
d0bb6b2
Compare
| /// Default is [`MAX_INLINE_VIEW_LEN`] bytes. | ||
| /// See <https://github.com/apache/arrow-rs/issues/7187> for more details on the implications. | ||
| pub fn with_max_deduplication_len(self, max_deduplication_len: u32) -> Self { | ||
| debug_assert!( |
There was a problem hiding this comment.
Why not leave use assert rather than debug_assert?
We should also document in the comments that the parameter must be greater than zero if we are going to assert
There was a problem hiding this comment.
i look at the with_fixed_block_size function, it use debug_assert:
pub fn with_fixed_block_size(self, block_size: u32) -> Self {
debug_assert!(block_size > 0, "Block size must be greater than 0");
Self {
block_size: BlockSizeGrowthStrategy::Fixed { size: block_size },
..self
}
}
| && value_3.len() < max_deduplication_len.as_usize() | ||
| ); | ||
|
|
||
| let value_checker = |v: &[u8], builder: &StringViewBuilder| { |
There was a problem hiding this comment.
Rather than asserting internal stae of the builder, how about setitng the max deduplicate len to something small (like 1) and then pushing deuplicated strings in
You can then assert that all the values point at distinct offsets in the resulting views
There was a problem hiding this comment.
Since 1 < MAX_INLINE_VIEW_LEN, it will be save and return directly:
pub fn try_append_value(&mut self, value: impl AsRef<T::Native>) -> Result<(), ArrowError> {
let v: &[u8] = value.as_ref().as_ref();
let length: u32 = v.len().try_into().map_err(|_| {
ArrowError::InvalidArgumentError(format!("String length {} exceeds u32::MAX", v.len()))
})?;
if length <= MAX_INLINE_VIEW_LEN {
let mut view_buffer = [0; 16];
view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
view_buffer[4..4 + v.len()].copy_from_slice(v);
self.views_buffer.push(u128::from_le_bytes(view_buffer));
self.null_buffer_builder.append_non_null();
return Ok(());
}
// ...
}There was a problem hiding this comment.
I guess I was thinking something like this which verifies the output (that the strings are not deduplicated) rather than the internal state of the builder:
#[test]
fn test_string_max_deduplication_len() {
let value_1 = "short";
let value_2 = "not so similar string but long";
let value_3 = "1234567890123";
let max_deduplication_len = MAX_INLINE_VIEW_LEN * 2;
let mut builder = StringViewBuilder::new()
.with_deduplicate_strings()
.with_max_deduplication_len(max_deduplication_len);
assert!(value_1.len() < MAX_INLINE_VIEW_LEN.as_usize());
assert!(value_2.len() > max_deduplication_len.as_usize());
assert!(
value_3.len() > MAX_INLINE_VIEW_LEN.as_usize()
&& value_3.len() < max_deduplication_len.as_usize()
);
// append value1 (short), expect it is inlined and deduplicated
builder.append_value(value_1); // view 0
builder.append_value(value_1); // view 1
// append value2, expect second copy is not deduplicated as it exceeds max_deduplication_len
builder.append_value(value_2); // view 2
builder.append_value(value_2); // view 3
// append value3, expect second copy is deduplicated
builder.append_value(value_3); // view 4
builder.append_value(value_3); // view 5
let array = builder.finish();
// verify
let v2 = ByteView::from(array.views()[2]);
let v3 = ByteView::from(array.views()[3]);
assert_eq!(v2.buffer_index, v3.buffer_index); // stored in same buffer
assert_ne!(v2.offset, v3.offset); // different offsets --> not deduplicated
let v4 = ByteView::from(array.views()[4]);
let v5 = ByteView::from(array.views()[5]);
assert_eq!(v4.buffer_index, v5.buffer_index); // stored in same buffer
assert_eq!(v4.offset, v5.offset); // same offsets --> deduplicated
}d0bb6b2 to
86006cc
Compare
| /// Some if deduplicating strings | ||
| /// map `<string hash> -> <index to the views>` | ||
| string_tracker: Option<(HashTable<usize>, ahash::RandomState)>, | ||
| max_deduplication_len: Option<u32>, |
There was a problem hiding this comment.
Will changing self.max_deduplication_len to u32 and setting the default value to MAX_INLINE_VIEW_LEN be better? After this, we can unify the logic in L354 to length < self.max_deduplication_len
| && value_3.len() < max_deduplication_len.as_usize() | ||
| ); | ||
|
|
||
| let value_checker = |v: &[u8], builder: &StringViewBuilder| { |
There was a problem hiding this comment.
I guess I was thinking something like this which verifies the output (that the strings are not deduplicated) rather than the internal state of the builder:
#[test]
fn test_string_max_deduplication_len() {
let value_1 = "short";
let value_2 = "not so similar string but long";
let value_3 = "1234567890123";
let max_deduplication_len = MAX_INLINE_VIEW_LEN * 2;
let mut builder = StringViewBuilder::new()
.with_deduplicate_strings()
.with_max_deduplication_len(max_deduplication_len);
assert!(value_1.len() < MAX_INLINE_VIEW_LEN.as_usize());
assert!(value_2.len() > max_deduplication_len.as_usize());
assert!(
value_3.len() > MAX_INLINE_VIEW_LEN.as_usize()
&& value_3.len() < max_deduplication_len.as_usize()
);
// append value1 (short), expect it is inlined and deduplicated
builder.append_value(value_1); // view 0
builder.append_value(value_1); // view 1
// append value2, expect second copy is not deduplicated as it exceeds max_deduplication_len
builder.append_value(value_2); // view 2
builder.append_value(value_2); // view 3
// append value3, expect second copy is deduplicated
builder.append_value(value_3); // view 4
builder.append_value(value_3); // view 5
let array = builder.finish();
// verify
let v2 = ByteView::from(array.views()[2]);
let v3 = ByteView::from(array.views()[3]);
assert_eq!(v2.buffer_index, v3.buffer_index); // stored in same buffer
assert_ne!(v2.offset, v3.offset); // different offsets --> not deduplicated
let v4 = ByteView::from(array.views()[4]);
let v5 = ByteView::from(array.views()[5]);
assert_eq!(v4.buffer_index, v5.buffer_index); // stored in same buffer
assert_eq!(v4.offset, v5.offset); // same offsets --> deduplicated
}| // (2) len > `MAX_INLINE_VIEW_LEN` and len < `max_deduplication_len` | ||
| let can_deduplicate = match self.max_deduplication_len { | ||
| Some(max_deduplication_len) => length <= max_deduplication_len, | ||
| None => true, | ||
| }; | ||
| if can_deduplicate { |
There was a problem hiding this comment.
It is minor, but as written this code will check the deduplicate length on each row, even if we are not deduplicating
How about only checking after the call to string_tracker.take()?
something like
// Deduplication if:
// (1) deduplication is enabled.
// (2) len > `MAX_INLINE_VIEW_LEN` and len < `max_deduplication_len`
if let Some((mut ht, hasher)) = self.string_tracker.take() {
if self
.max_deduplication_len
.map(|max_len| length > max_len)
.unwrap_or(false)
{
let hash_val = hasher.hash_one(v);
let hasher_fn = |v: &_| hasher.hash_one(v);
let entry = ht.entry(
hash_val,
|idx| {
let stored_value = self.get_value(*idx);
v == stored_value
},
hasher_fn,
);
match entry {
Entry::Occupied(occupied) => {
// If the string already exists, we will directly use the view
let idx = occupied.get();
self.views_buffer.push(self.views_buffer[*idx]);
self.null_buffer_builder.append_non_null();
self.string_tracker = Some((ht, hasher));
return Ok(());
}
Entry::Vacant(vacant) => {
// o.w. we insert the (string hash -> view index)
// the idx is current length of views_builder, as we are inserting a new view
vacant.insert(self.views_buffer.len());
}
}
}
self.string_tracker = Some((ht, hasher));
}There was a problem hiding this comment.
In the code you provided, every time it will take string_tracker out and store it, even though v.len > max_deduplication_len, i think saving string_tracker price is more than v.len > max_deduplication_len
There was a problem hiding this comment.
what I am worried about is slowing down the case when string deduplication is not enabled. I will run some benchmarks to make sure this change doesn't affect performance
There was a problem hiding this comment.
in the new pr, i change the code as:
let can_deduplicate = if self.string_tracker.is_some() {
match self.max_deduplication_len {
Some(max_deduplication_len) => length <= max_deduplication_len,
None => true,
}
} else {
false
};86006cc to
46b786a
Compare
|
run benchmark view_types concatenate_kernels |
| let can_deduplicate = if self.string_tracker.is_some() { | ||
| match self.max_deduplication_len { | ||
| Some(max_deduplication_len) => length <= max_deduplication_len, | ||
| None => true, | ||
| } | ||
| } else { | ||
| false | ||
| }; |
There was a problem hiding this comment.
Here is another possibly clearer way to express this same logic:
let can_deduplicate = self.string_tracker.is_some()
&& self
.max_deduplication_len
.map(|max_length| length <= max_length)
.unwrap_or_default();| } else { | ||
| false | ||
| }; | ||
| if can_deduplicate { |
There was a problem hiding this comment.
You can avoid the extra indent layer by combining the statements together I think:
Instead of
if can_deduplicate {
if let Some((mut ht, hasher)) = self.string_tracker.take() {This:
if can_deduplicate && let Some((mut ht, hasher)) = self.string_tracker.take() {StringView
251ed58 to
9610686
Compare
|
show benchmark queue |
|
🤖 Hi @alamb, you asked to view the benchmark queue (#8990 (comment)).
|
|
Oh no, I think #8990 (comment) broke the MSRV check -- we'll have to back that out. Sorry @lichuang |
…while building the array
9610686 to
d058cb4
Compare
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Benchmark results look good to me. Let' go! |
Configure max deduplication length when deduplicating strings while building the array
Which issue does this PR close?
Configure max deduplication length when deduplicating strings while building the array
Rationale for this change
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
What changes are included in this PR?
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
Are these changes tested?
We typically require tests for all PRs in order to:
If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.